Corpus-Induced Corpus Clean-up

نویسنده

  • Martin Reynaert
چکیده

We explore the feasibility of using only unsupervised means to identify non-words, i.e. typos, in a frequency list derived from a large corpus of Dutch and to distinguish between these non-words and real-words in the language. We call the system we built and evaluate in this paper CICCL, which stands for ‘Corpus-Induced Corpus Clean-up’. The algorithm on which CICCL is primarily based is the anagram-key hashing algorithm introduced by (Reynaert, 2004). The core correction mechanism is a simple and effective method which translates the actual characters which make up a word into a large natural number in such a way that all the anagrams, i.e. all the words composed of precisely the same subset of characters, are allocated the same natural number. In effect, this constitutes a novel approximate string matching algorithm for indexed text search. This is because by simple addition, subtraction or a combination of both, all variants within reach of the range of numerical values defined in the alphabet are retrieved by iterating over the alphabet. CICCL’s input consists primarily of corpus derived frequency lists, from which it derives valuable morphological and compounding information by performing frequency counts over the substrings of the words. These counts are then used to perform decompounding, as well as for distinguishing between most likely correctly spelled words and typos.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Corpus-Based Study of the Lexical Make-up of Applied Linguistics Article Abstracts

This paper reports results from a corpus-based study that explored the frequency of words in the abstracts of applied linguistics journal articles. The abstracts of major articles in leading applied linguists journals, published since 2005 up to November 2001 were analyzed using software modules from the Compleat Lexical Tutor. The output includes a list of the most frequent content words, list...

متن کامل

TICCLops: Text-Induced Corpus Clean-up as online processing system

We present the ‘online processing system’ version of Text-Induced Corpus Clean-up, a web service and application open for use to researchers. The system has over the past years been developed to provide mainly OCR error post-correction, but can just as fruitfully be employed to automatically correct texts for spelling errors, or to transcribe texts in an older spelling into the modern variant o...

متن کامل

Non-interactive OCR Post-correction for Giga-Scale Digitization Projects

This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce ’tickle’) focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Leven...

متن کامل

Synergy of Nederlab and @PhilosTEI: diachronic and multilingual Text-Induced Corpus Clean-up

In two concurrent projects in the Netherlands we are further developing TICCL or Text-Induced Corpus Clean-up. In project Nederlab TICCL is set to work on diachronic Dutch text. To this end it has been equipped with the largest diachronic lexicon and a historical name list developed at the Institute for Dutch Lexicology or INL. In project @PhilosTEI TICCL will be set to work on a fair range of ...

متن کامل

Protective effects of erythropoietin against cuprizone-induced oxidative stress and demyelination in the mouse corpus callosum

Objective(s): Increasing evidence in both experimental and clinical studies suggests that oxidative stress plays a major role in the pathogenesis of multiple sclerosis. The aim of the present work is to investigate the protective effects of erythropoietin against cuprizone-induced oxidative stress. Materials and Methods: Adult male C57BL/6J mice were fed a chow containing 0.2 % cuprizone for 6 ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006